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MEASURING AND CORRECTING FOR INFORMATION LOSS 
IN CONFIDENTIALISED CENSUS COUNTS 


Janice Wooton 
Statistical Services 


ABSTRACT 


The Australian Bureau of Statistics (ABS) has developed a new confidentiality 
protection method for census tables, to be applied for 2006 census data. The method 
differs from more traditional disclosure control methods in that there are a number of 
parameters that can be set to fine-tune the methodology. In order to determine the 
best settings for these parameters, the ABS has investigated and attempted to balance 
the benefit (level of protection or reduction of risk of identification) with the cost 
(damage done to the integrity of the table, or the information loss). This paper 
discusses a number of ways to measure information loss in tables . In particular, a 
detailed examination of the y? test of association in a three dimensional table is 
undertaken. We show that the confidentiality procedure produces a positive bias on 
this statistic and on certain partitions of it. As a result of this work, we are able to 
quantify the impact on the 7? test of association due to the confidentiality protection, 
and provide advice for users on how to compensate for this effect. 


Keywords: Cell Perturbation, Information Loss Measures, Frequency Tables, 
Chi-Squared Test of Association, Confidentiality, Statistical Disclosure Control. 
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1. INTRODUCTION 


Tables of counts from the Census of Population and Housing are one of the ABS’s 
most widely used products and the ABS is under a legislative obligation to maintain 
the confidentiality of the people who provide these data. For the 2006 census tabular 
output there will be improvements in the way users can access census data and finer 
level geographical building blocks called mesh blocks will be made available. Mesh 
blocks and improved access modes give users more flexibility and more scope to build 
and define their own tables. The method of confidentialising census tables used prior 
to 2006 is no longer adequate in this new environment because the disclosure risk, 
through table differencing, is too high (for further details see Wooton and Fraser, 
2005). The old method adjusts small cells only and tables that differ slightly in their 
definition could be differenced from one another to obtain detailed unconfidentialised 
and potentially identifying small subpopulation data. 


To solve this problem the ABS has recently developed a new cell perturbation 
confidentiality methodology to be applied to all census tables of counts before release. 
This new method protects against disclosures occurring through table differencing 
because small noise terms are added to all cells and not just to the smaller cells. 
Therefore under the new method, if a user differences two tables and obtains small 
cell counts they cannot be certain of the exact original cell values of these small 
differences thereby protecting small cell counts from being revealed with certainty. 


There are various parameters associated with this new cell perturbation method. The 
parameter values ultimately control both the amount of information loss and the 
identification risk in a perturbed table. Parameters need to be chosen to give a good 
compromise between minimising these two conflicting attributes. Before making a 
decision about the parameter values it will be necessary to somehow measure both 
information loss and identification risk in the perturbed tables. 


The main focus of this paper will be on how to measure the information loss in 
perturbed census tables of counts. In Section 2 we briefly describe the new cell 
perturbation methodology and the parameters that need to be chosen. A Monte Carlo 
study is then undertaken in Sections 3 to 6 which examines: (i) the perturbation 
distributions and the distortion to the original cell counts; (ii) the variance and 
covariance structure of the perturbations within tables; and (iii) the impact of the 
perturbations on contingency table analyses and tests. From this empirical 
investigation we obtain many useful insights into both information loss and 
identification risk. We then discuss information loss measures in more detail in 
Section 7 and how to adjust analyses to correct for the effects of perturbation. These 
results will also help determine a good choice of perturbation parameter values. 
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2. BRIEF DESCRIPTION OF THE CELL PERTURBATION METHODOLOGY 


Perturbations are added to all cells in a two stage process, described in detail below. 
For the 7-th cell in a table we have (future references drop the 7 subscript): 


A, =U; +8, -U;) + (4; -F) 


=U; + € yyy + Ca) (1) 


where A; is the additively perturbed cell count that the ABS will be publishing, U; is 
the original (unconfidentialised) cell count and P; is the consistently perturbed cell 
count. There are two random noise terms €pi) and egy) which are, respectively, the 
discrete stage 1 and stage 2 perturbations. We will now describe how these are 
generated and why a two stage process is needed. 


Before any tables are perturbed or even produced we first assign to each unit on the 
microdata file a permanent independent discrete random uniform number on the 
interval [0,V—1], where N is a large positive integer. These are called Rkeys and are 
used to generate consistent values of ep. That is, whenever the same group of units 
are in a cell, the same value of ep is always generated. This is achieved through a 
function that maps the contributing Rkeys to a new integer value in the interval 
[0,V—1]. This new value is referred to as the Ckey of the cell. Addition modulo N is 
one example of a suitable function. Each Ckey and U value are then both mapped to a 
probability distribution guaranteeing the same random perturbation ep is always 
applied whenever the same set of contributors are in a cell. 


The distribution for ep is chosen to balance measures of both information loss and 
identification risk. Identification risk is related to uncertainty. The more uncertain we 
are about an outcome, the smaller the identification risk. Information entropy is a 
measure of the uncertainty of an outcome (see chapter 11 of Jaynes (2003) for further 
details). Conditional on U, an appropriate distribution of ey can be obtained by 
maximising the entropy (uncertainty) subject to some information loss constraints. 
That is, for each U we maximise the function 


->) Pep =| U)log P(e, =k|U) 


subject to the following constraints: 
1 DY, Pep =R|U) =1 and Pep =R|U) = 0 for all B. 


2. E(ep|U) =O and Var(ep | UV) =cu, where cy is a non-negative constant that needs 
to be set. 


3. U+ep =0. 
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4. Given U, ep € [-dy,-dy +1,...,-1,0,1,...,du—1,dy] excluding values in this 


integer range where 3. above is not satisfied. dy is a non-negative integer 
constant that needs to be set. 


The parameters values cy and dy need to be chosen before any of the ep» are 
generated. By adjusting these we have some control over both information loss and 
identification risk. 


There is no closed form solution to this problem, but it can be written in terms of 
Lagrangian multipliers and then solved numerically. The solution is, 


2 
P(e, —p | U)= eu MH ayk+ yk 


where Ay, ay and fy are chosen to satisfy 


Dy Pep =RIU)=1 
D Pep =RIU)R=0 
DY Pep =RILWR* = cy 


An @p value is generated independently in every cell of the table including marginal 
and grand total cells. The table defined by the set of U+ep values is not additive in 
general (by additive we mean additive relationships such as row totals adding to the 
grand total). To restore additivity to the table, we add in the second stage 
perturbation eg to each cell. Ideally we would like the set of eg terms for a table to all 
be as close to 0 as possible to ensure some consistency for the same cells in different 
tables. To generate the set of eg for a given table, we use an iterative fitting algorithm 
developed by the ABS. This algorithm attempts to balance and minimise squared 
distances to the set of P=U+ ep» values for a given table subject to all the additive 
relationships being maintained between cells. eg =0 is always guaranteed in grand 
total cells to ensure consistency of grand totals across tables. 


As we have seen, the perturbation process adds random noise values e; = ep +g to all 
cells in a table. This means that original cell values and original cell proportions will 
get distorted under perturbation leading to information loss. In order to determine 
the amount of distortion it will be necessary to examine the distributional properties 
of the e; terms. This examination can only be done empirically via simulation because 
it is not immediately clear what the exact distributions of the e; terms will be. In 
addition it is expected that these distributions will depend on factors such as the 
number of additivity constraints in a table, the dimension of the table, the number of 
categories within a dimension, sparsity and the total sample size in a table. A Monte 
Carlo study is therefore undertaken and the results are discussed in the next section. 
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3. EXAMINATION OF THE DISTRIBUTIONAL PROPERTIES OF e; 


Information loss is something that needs to be examined from many perspectives. 
There is no one measure which can sufficiently cover everything we want. However 
we can assert that when information loss is small, then Var(e;) should be small, 
E(er) = 0 (to avoid bias) and P(|e;| >dy) should be very small. The shape of the 
distribution will also play a role and needs to be considered. It is therefore useful to 
look at these attributes. 


Before we simulate any perturbed tables we need to choose the set of perturbation 
parameters {Cy,dy}. For the simulation study these will be co=0, do=0; c1=2.5, 
d,=3; c2=4, d2=3; and for U> 3, cu=4, du=5._ These were thought to be reasonable 
initial choices and hopefully would give a good compromise between information loss 
and disclosure risk. 


The Monte Carlo study is restricted to the examination of three dimensional weekly 
individual income by age group by sex tables from the 2001 census. These are the 
largest tables in terms of the number of cells in the 2001 Basic Community Profile 
series published by the ABS. Two different 2001 SLA (Statistical Local Area) 
geographies are chosen for the analysis, call these SLA 1 and SLA 2. The SLA 1 table 
has an average original cell count of 4.7 (it contains about 1500 people) and SLA 2 has 
an average original cell count of 0.4 (it contains about 100 people). SLA 2 is very 
sparse and all but one of its interior cells has an original count of 0, 1 or 2. 


For each of the two SLAs we simulate 10,000 independent sets of additively perturbed 
tables. This is done by simulating 10,000 sets of Rkeys for each SLA and then for each 
set of Rkeys calculating the additively perturbed table using the steps outlined in 
Section 2 to obtain 10,000 sets of U+ ep + eq values for each SLA. 


For the analysis we group various cells together. The distribution ofe» given U are 
the same for U>8 and is symmetric and bell shaped. Therefore as U gets sufficiently 
far from 0 we expect eg and hence e; to behave the same as well. We group cells with 
U> 11 together and the rest we examine individually. There are also various different 
cell types in a table which are defined by whether the cell is a particular marginal total, 
subtotal, grand total or interior cell. For our three dimensional cross classified table 
there are eight different cell types. These cell types can be obtained by summing over 
particular combinations of dimensions in the table and cells will be grouped according 
to the cell type. 


Figure 3.1 compares the @» and e; distributions for interior cells with U> 11 for SLA 1. 
The P(|e:| >5) is very small and the two distributions are similar in shape with e; 
being the more peaked of the two. In general when examining figure 3.1 and other 
graphs (not given here) for the different cell types and U values, the e; distributions 
given U appear to compromise well between information loss and identification risk. 
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3.1 Comparison of ep and e; distributions for the interior cells of SLA 1 (for U > 11) 


0.3 


0.25 


0.2 


Probability 
oO 
a 


0.1 


0.05 


7 6 5 4 3 2-101 2 3 4 5 6 7 


There is always sufficient uncertainty in outcomes and mostly only small noise terms 
are being added with high probability. Therefore on average the cell count distortions 
will not be too large. 


An estimate of E(e;) for each cell was calculated based on the 10,000 sets of e; values. 
Expectations were all found to be less than 0.5 in magnitude and mostly very close to 
0. Therefore E(e;) ¥ 0 is a fair conclusion and in general the perturbation process is 
approximately unbiased. An examination of the estimated cell variances was also 
undertaken. These were in general found not to be too large. 


Figure 3.2 contains boxplots of the distributions of estimated variances of e; within 
each cell type for U> 11 for SLA 1. Notice that the variances of e; are in general 
smallest for the interior cells of the table and were largest for certain marginal 
subtotals. More noise is therefore being introduced to the marginals, although in 
general the noise is smaller as a proportion of the cell frequency U than is the case for 
interior cells. The noise in the marginals is also small relative to the noise introduced 
by some alternative methods. Suppose we had applied uncontrolled random 
rounding to base 3 to cell counts less than 3 instead. Then the variances in the sex 
subtotal cells would have been about 55. 
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3.2 Distribution of e; cell variances within each cell type 


Variance 


ow L L L L L a) ac) 
¥o) Sg g © Sg @ © @ 
= 2 2 & 2 & & .o) 
3S ts) 5 > 5 = » 5 
= > =| © > © © = 
pe ” ” £ ” € € o) 
Oo 2 a x im ® x f= 
D> oO ® £ fo) (a) = 

<x o 9 © ” 

> > 

wo) £ = To} 

fob) (<b) fob) 

3 ep OE 

fo) {e) 

oO oO 

= £ 

Cell type 


We do have to be careful that the variances are not too small because then we may 
have too large an identification risk. We also examined correlations between ey, and 
€q Within each cell and found that for V> 11 for SLA 1, the interior cells had the 
largest negative correlations (approximately —0.65). This may not be ideal from an 
identification risk perspective. This is because we know the value of U+ep + a, since 
this is the published interior cell count say. It is relatively easy to derive U+ep 
because we could just request another table with the interior cell count as the grand 
total. eg =0 is always guaranteed for grand totals and therefore we can derive ég for 
the interior cell via differencing. So given eg, can we predict ey with good precision 
and hence U? If the answer is yes we have a high identification risk. We examined 
plots of ep versus eg and determined that ep cannot be predicted with good precision 
for any value. The effect is that in general given eg, we are roughly halving the 


variance of ep. 
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4. ANALYSIS OF CONTINGENCY TABLES 


So far we have only looked at the distributional properties of the perturbations. Now 
we will determine how perturbations affect contingency table analyses and tests. 
When analysing contingency tables, we are often interested in answering such 
questions as: 


1. Is there an association between certain categorical variables? 


2. What is the nature (or direction) of the association (assuming 1. holds)? For 
example, can we conclude that income increases with age? 


Pearson’s y* test and log-linear model analyses can be used to address the above 
questions. As Beh and Davy (2004) suggest, from the analysis of data using log-linear 
models, the researcher can determine important associations that exist in the data. 
However as Beh and Davy (2004) go on to state, there are some problems with this 
method. One is that the selection of an optimal log-linear model requires a trial and 
error approach of fitting and refitting which could lead to computing a large number 
of models. This is not ideal for our situation where we would like to analyse 
thousands of simulated tables. Some other issues are that the conventional method of 
estimating parameters is to use an iterative maximum likelihood technique such as 
Newton—Raphson. To apply this to thousands of tables will be computationally 
intensive. Also, sometimes the Newton—Raphson procedure may not converge to a 
solution. 


Applying Pearson’s 77 test has a distinct advantage over the use of log-linear models 
for addressing question 1. It is computationally easy to calculate, does not involve 
using iterative methods, can easily be applied to thousands of tables and only needs to 
be applied once irrespective of the relationship. However, Pearson’s y? statistic 
cannot be used on its own to address question 2. It does not give us any information 
about the nature of the relationship between variables if there is one. But if at least 
one of the categorical variables is ordinal, then we can partition Pearson’s y? statistic 
using orthogonal polynomials as outlined in Rayner and Best (2001), Beh and Davy 
(1999) and Beh and Davy (1998) and undertake more specific directional testing to 
address question 2. 


Beh and Davy (1999) describe a partition which can be used on doubly ordered three 
way tables. Using this partition, information about the relationship between the 
variables can be obtained by identifying important associations in terms of the location 
(linear), dispersion (quadratic) and higher order components. The directions of the 
associations can also be determined. We will apply this methodology to our income 
by age by sex tables and use it to determine how associations change after 
perturbation is applied. Some advantages of this method are that model selection is 
not necessary, calculating the partitions does not involve iteration and it can easily be 
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applied to thousands of simulated tables. Interestingly, components of the partitions 
can also be used to directly estimate parameters in ordinal log-linear models (without 
iteration). See Beh and Davy (2004) and Beh and Farver (2006) for more details. 
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5. BRIEF DESCRIPTION OF THE BEH AND DAVY (1999) PARTITIONS 
OF PEARSON’S ? STATISTIC METHODOLOGY 


We have a three way table with a grand total. The table has J rows (¢= 1, 2,...,/), 


J columns (7 = 1, 2,...,/) and K tubes (R = 1, 2,...,K) and the (7,7, &)th cell relative 


is 
frequency is Pp jz = Fz — 


It is assumed that the rows and columns are ordered and the tubes are not. For our 
table we take income categories as the rows, age categories as the columns and sex as 
the tubes. A dot on a subscript indicates summation over that dimension. 


Pearson’s 7 statistic can be partitioned as 


pl Fare pap XK Fei 
x 7= yD, 24 Fv vt ED vee 
u=lv=1k=1 u=1k=1 v=1k=1 (2) 


2 2 2 
= Xy(K) * KICK) F XIK) 
where 


LL 4, Ob, (A Pie 
Viwk = V0 ee 3 
. py V PLR — 


The set {a@,,(Z)} are orthogonal polynomials on {p;,.} and the set {b,(7)} are 
orthogonal polynomials on {p ;,}. These can be generated using the formulae given 
on page 70 of Beh and Davy (2004) and we use natural scores. The three y” partitions 
in (3) each have asymptotic independent 7 distributions under the null hypothesis of 
independence with degrees of freedom (J— 1)(//— 1)K, 7— 1)(K- 1) and (J- 1)(K- 1) 
respectively. The Y,,,2 values are asymptotically normal with mean 0 under the null 
hypothesis of independence and can be used to detect any associations on a category 
level. According to Beh and Davy (1999), Yinp (for uz >0 and v >0) describes the 
effect the (z,v)th bivariate moment has on the k-th non-ordered tube category, Y,,og 
describes how the wth univariate moment of the rows affects the k-th non-ordered 
tube category and Yo,zg describes how the vth univariate moment of the columns 
affects the k-th non-ordered tube category. 


The y? partitions in (2) can also be broken down into further partitions as 


Gee Ty Ga ae SS (4) 


v=1k=1 v=1k=1 v=1k=1 u=rt+lv=1k=1 
(ae ae : 
XK) = yoy 24 ¥ OY ik t= ayyy Dae (5) 
u=1k=1 u=1k=1 u=1k=1 u=lv=r+1k=1 
Lick) = & Mn+ FV. +3 vans > v v2 (6) 
u=rt+1k=1 
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K K K J-1 XK 
Liew =D Your + D Yoo +--+ Y Yor + » Yen (7) 
k=1 k=1 kR=1 v=rt+1lk=1 
Each of the +1 partitions on the right hand side of equations (4), (5), (6) and (7) 
above follow y? distributions asymptotically under the null hypotheses of 
independence. These can be used for specific directional tests. Common values for 
choice of r are either 2 or 4. As Rayner and Best (2001) suggest, we are usually only 
specifically interested in terms relating to the first four moments at most (often two 
are enough to describe a relationship). A reasonable approach to analysing a 
contingency table would be to calculate all +1 terms on the right hand side of 
equations (4) to (7) and do formal tests of significance of these. Once significance is 
established we could informally look at individual Y,,,z values to determine the 
direction of any associations. This approach is undertaken by both Rayner and Best 
(2001) and Beh and Davy (1999) and we will do so too. 
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6. APPLYING PEARSON’S ? PARTITIONS TO THE ORIGINAL AND THE 
ADDITIVELY PERTURBED SIMULATED TABLES 


We have doubly ordered three way tables with 7=1, 2, ..., 14 ordinal income 
categories (in order of increasing income), 7=1, 2, ..., 8 ordinal age categories (in 
order of increasing age) and R=1,2 non-ordinal sex categories (1=male and 
2=female). The analysis is restricted to positive stated income. 


SLA 2 is very sparse which means that the asymptotic y* null distribution may not 
hold. We calculate Monte Carlo p-values for both SLA tables. This is done by 
conditioning on the income subtotals, sex subtotals and age subtotals as suggested in 
Mehta and Patel (1997) and then simulating 10,000 tables under the null hypothesis of 
complete independence using the algorithm in Agresti et al. (1979). For each of these 
10,000 tables we calculate all the v7 partitions whose distributions can then be used to 
calculate p-values. These distributions will be denoted by ‘independence’ in future 


references. 


SLA 2 also contains zeros in some of the marginal totals. We add a small value to each 
cell before calculating any partition. Justification for this can be established through a 
Bayesian argument (see page 607 of Agresti, 2002). All that is needed is a prior guess 
of the cell probabilities. We use the Australia level table as our prior guess of the cell 
probabilities. A constant multiplied by the prior cell probability of the 7-th cell is then 
added to the /-th cell count before the 7? statistics are calculated. The constant 
chosen is 0.1. 


We now apply the methodology outlined in Section 5 to both the original SLA tables 
with r=4 (r is defined in Section 5). For SLA 2, Pearson’s y? statistic is not significant 
(p-value=0.5156). For SLA 1, Pearson’s 7? is highly significant (p-value <0.00001) 
suggesting that there is an association between income, age and sex. All v7 partitions 
(4) to (7) for SLA 1 are significant overall except for (7). This implies that when 
income is ignored there is little evidence of an association between age and sex but 
there is an association when comparing other dimensions. The three largest Yj,» 
values in terms of their magnitude are Yi21 = —6.3, Yio. = —6.3 and Yio1 = 5.9. Yia1 
describes the linear by quadratic association between income and age for males and it 
is negative. Yio2 and Yo: are the two income location terms (ignoring age). The values 
of these suggest that females tend to have smaller incomes than males. There are 
other significant terms, but we will not give detailed interpretations here. 


In any case, the question we are interested in is to what extent do these conclusions 
get changed after perturbation? The Y,,,z% values are important because they 
determine the magnitude and direction of an association. We calculated all the Y,,.p 
for the simulated additively perturbed tables and a typical distribution is summarised 
in figure 6.1. 
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6.1 Distributions of Y:21 for SLA 1 


Base 3 


Independence 
Original value 
Additively perturbed » = 


Figure 6.1 shows boxplots of the distributions for component Yj21 in SLA 1. The 
‘independence’ distribution is the distribution of Yi21 under the null hypotheses of 
independence. ‘Original value’ is the value of Y121 for the original unperturbed table. 
‘Base 3’ is the distribution of Yi21 obtained under an alternative confidentiality method 
which applies uncontrolled random rounding to base 3 of the 1’s and 2’s. Base 3 is 
included for comparative purposes. ‘Additively perturbed’ is of course the 


distribution of Yi21 under perturbation. 


We generated many similar plots for the other components as well and a consistent 
pattern emerged. The distribution of Y,,,2 under perturbation (and Base 3) is roughly 
centred around the original Y,,,z value. So under perturbation we are adding noise to 
each component. Denote the noise by e,,,2 and this noise term follows a roughly 
symmetric distribution with mean 0. That is, Y7.2 =Yuve t€uvk, Where Y7,p is the Ye 
component we obtain under perturbation. To obtain Pearson’s v7 value and the other 
partitions described in equations (4) to (7) for the original table, we add appropriate 


Y,wk Sums of squares together and calculate, 


M 
Partition = > Yj, (8) 


m=1 
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6.2 Distributions of Pearson’s ? statistic for SLA 1 
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where m denotes a particular combination of wvk values and M is the total number of 
squared components we add together. 


After perturbation is applied to a table we only have the Y;, values available and so the 
partitions are calculated using, 


M 
Partition = ae ( 


=1 
ey (9) 
eo (V2 + Y nm +e) 
m=1 


If we assume that E(€) =0, which is a reasonable assumption as noted above, then 
the expected partition under perturbation is, 


M 
(Partition )= >° (%, + 2Y,,E(€,,) + E(m)) 


m=1 


M 
> ee + Var(em)) (10) 
m=1 


M 
= Partition + » Var(ey,) 


m=1 
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6.3 Distributions of Pearson’s ? statistic for SLA 2 
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This implies that in general under perturbation, there will be an upward bias on each 
partition including Pearson’s 7? statistic. The implication of this is that in general the 
p-values of the partitions will be smaller than the original table, potentially giving a 
false impression of significance. 


Figures 6.2 and 6.3 contain boxplots of the distributions of Pearson’s y? statistics 
under perturbation for SLA 1 and SLA 2. These graphs clearly show an upward bias in 
the statistics as expected under both perturbation and random rounding to base 3 of 
the 1’s and 2’s. For SLA 2, the bias is much larger and the 7? statistics are more 
variable. For SLA 2, most of the time we would conclude that there is an association 
when the original value suggested otherwise. 
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7. INFORMATION LOSS MEASURES 


So far we have examined the distributional properties of the cell perturbations and 
how perturbation affects contingency table analyses. Perturbation leads to some 
distortion of the original cell counts and introduces an upward bias to Pearson’s y? 
statistic. Ideally users would like to be able to adjust their analyses to account for this 
damage. It is therefore of interest to develop a set of information loss measures that 
could potentially be published, informing users of the impact of the confidentiality 
procedure. One useful reference on this subject is Shlomo and Young (2005). We 
take a similar approach to these authors and divide the information loss measures 
according to the statistical aspect to be measured. 


We saw in Section 6 that perturbation leads to an upward bias in v? statistics. 
Although we used a specific y* decomposition in Section 6, relevant for a doubly 
ordered three way table, we still expect that in general there will be an upward bias in 
x? statistics in other types of tables as well. One measure of information loss could 
focus on this bias. That is, calculate 
KUY-)) 
EQ" )-x°% DY) Var(em), (11) 


m=1 


where y2* is the value of y* under perturbation and ej,» is as defined in Section 6. We 
could then calculate the average percentage increase in 7? as, 


K(UJ-1) 
Var(e€,,) (12) 
Lol 5 x 100%, 
x 
and (12) can be estimated using 
KU-)) 
>, Var(e,,) 
=i 
7 KA) x 100%, (13) 
y eo 3 Var(e,,) 
m=1 


K-1 
assuming py Var(€m) is known (or replaced with an estimate) and the 


denominator is positive. 


The measure defined at (13) gives users an idea of the average amount of information 
loss in percentage terms inherent in the v7 test of association for a given perturbed 
table. The variances of the e€;, terms will depend on certain table attributes, such as 
the number of small cells, the number of categories, dimensions and the number of 
additivity constraints. It may be possible to build up a model to predict these 
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variances given certain table properties, instead of relying on simulations which are 
computationally intensive. This is a topic for further research. 


In any case, given an estimate of the e,, variances, the bias in y? statistics can be 
corrected for. Instead of using y*, the user should use 


Yai re ” Var(em) 


instead (assuming this value is positive). There will also be a variance on this 
difference. If this was known then an approximate conservative confidence interval 
for Pearson’s y’statistic under perturbation could be calculated. 


As we saw in Section 6, the Y,,z. are also important because they determine the 
magnitude and direction of an association. If users had an idea of the variance of 
these terms under perturbation (they have already been found to be approximately 
unbiased), then approximate confidence intervals for these could be calculated as 
well. Confidence intervals are good information loss measures because the length of 
these indicates the uncertainty about a parameter or statistic we are introducing due 
to perturbation. 


In contingency table analyses we are treating the census data as though they were a 
random sample from a superpopulation. We can hypothesise that the data was 
generated from some parametric model. In Beh and Davy (1999) a model of 
association is defined for a doubly ordered three way table as, 


I-1J-1 
Pik = DP; oor “Wy Yoru + 5 You (SJ) te REP (14) 
NEL Dk 


u=1 V"P_k v=1 V"P_k u=lv=1 


where under the null hypothesis of independence all the Y,,,2 are 0. When analysing 
our doubly ordered three way census table we could hypothesise and then assume 
that the census data were generated from a multinomial model with the above cell 
probabilities. We are then interested in estimating the set of superpopulation 
parameters {Y,,2} and the sets of nuisance parameters {p;..}, {D,,} and {pz}. The 
Census data give us information about these parameters. To measure the amount of 
information about a particular parameter inherent in a sample we could use the 
expected Fisher information measure. See for example Mathai and Rathie (1975) or 
Azzalini (1996) for further details. This measure of information is defined as, 


£{Zog 1}, (15) 


where @ is a particular parameter we are interested in estimating and Z is the 
likelihood. 
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We can use Fisher’s expected information to help us derive an information loss 
measure. Firstly, let’s simplify the model defined at (14). Suppose we are interested 
in estimating the probability of a given population unit being in a particular cell in the 
contingency table, where the overall sample is fixed to be 7. We have a sample size of 
N, in this cell and sample size of m —N not in the cell. Denote the probability of 
being in the cell by p;. Our aim is to estimate the superpopulation parameter p, and 
for simplicity we will assume here that the cell count NV follows a Binomial (7,1) 
distribution. It can be easily proved that the maximum likelihood estimator of p, is 


boo ING 
Pi == 
n 
and the expected Fisher Information is, 
oe (16) 
Var( pi) 
and (16) can be estimated using 
ee oe 
PiA- pi) 


Under perturbation, we add random noise to each cell. So instead of observing Ny; 
and 7 we observe Nj =N 1 +€,1) and n* = + pis), where e,(1) is the total additive 
perturbation for the interior cell and epi) is the consistent stage 1 perturbation added 
to the grand total cell (eg =0 is ensured in grand total cells). See Section 2 fora 
definition of these perturbations. Because of perturbation we can no longer estimate 
pi with p,. Instead we use 


AK N, 


1 Te 


which leads to some information loss and an increase in the variance. To get a 
measure of the information loss due to perturbation we could calculate the expected 
Fisher information using the joint probability of Nj and m* and subtract this from 
(16). That is, we could calculate 


1 0 * OK? 
Var(B,) AG log P(N,” ))"), (17) 


But (17) is difficult to compute. Instead a rough measure of the amount of 
information loss due to perturbation can be defined as (in part suggested by the form 


of (16)), 
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. 1 1 
Information loss = (18) 


Var(Pi) Var( pi) 


with the term Var(67) in (18) approximated using a first-order Taylor series 
expansion. This approximation is, 


Var(€,1)) a pivar(e p(s) 2P1CovE ray, 2 psy) 


n? n° n* 


Var( py, ) ~ Var( py) + (19) 


The Cov(ew1),€p(s)) term in (19) will be positive in general. An approximate 
conservative estimator of Var(p7) can be found by setting Cov(e,1), eps) = 0 in (19). 
Assuming this, we can now calculate an estimate for the information loss using (19) 
and replacing p with fj and 7 with n*. This estimate is, 


nN nN 


—_ AQ 2 2 
Sos om Bi) 


Information loss = 


nN 


assuming we have estimates of Var(e,1)) and Var(é psy) available and Pj is non-zero 
(if B| =0 we could replace it with a small value instead). We can also estimate the 
percentage information loss as, 


p eatiionc nd AD 
if : 1 Var(e,1)) “ Py Var psy) 
Var pb, : ‘ 
(Py) : Var Pi) 100% w n a . ase oe (21) 
etree nk ni t() 1 PS) 
Var(p,) DA py) ie Fey 


This suggests that information loss with respect to estimating a cell proportion 
decreases as 7 increases. 


Now suppose that the original cell counts in our table are fixed and not generated 
from a superpopulation model. Instead of observing the fixed count 71 in a particular 
cell we observe 7} =” +€,1) instead. This leads to distortion to the original cell 
counts. If we had information available about the variance of the e; terms in each cell, 
then approximate confidence intervals for the original cell counts could be calculated 
giving the user an idea of the amount of information loss and uncertainty introduced. 
This would help users make more informed decisions. A rough measure of the 
amount of information loss in a particular cell can be approximated using the 
coefficient of variation (by noting that E77) ~71), 


[Var (€,)) 22) 


’ 
al 
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or estimated using 


4Var(eray) 


ok 
my 
assuming 72] > 0 and an estimate of Var(e,1)) is available. If the coefficient of variation 
is large then the estimate 77 of 7, is considered unreliable and the amount of 
information loss relative to the cell size is large. 


So far we have focused on the distortion to the original cell counts in particular cells. 
It may also be useful to publish overall measures of cell distortion for a table. Shlomo 
and Young (2005) describe various distance metrics that can be used for this purpose. 
For example they apply Hellinger’s distance, a relative absolute distance and an 
average absolute distance metric to measure distances between the original cell 
frequencies and the perturbed cell frequencies. See page 5 of Shlomo and Young 
(2005) for further details. 
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8. CONCLUSION 


The ABS’ new cell perturbation methodology is designed to minimise information loss 
in tables subject to certain identification risk constraints. Parameter values associated 
with the method can be chosen to control to some extent both of these conflicting 
attributes. In this paper we empirically examined the distributional properties of the 
cell perturbations for two example tables and gained insights into how cell counts will 
get distorted under perturbation. We also examined the impact of perturbation on 
contingency table analyses and tests for these tables. 


We demonstrated that perturbation will lead to an upward bias in the Pearson’s 7 
statistic and this result also applies to tables that have been subject to random 
rounding. This upward bias means that p-values under perturbation will be on 
average smaller and this may lead to a false impression of significance in v7 tests of 
association. The bias was shown to be the sum of the variances of certain noise terms. 
So if estimates of these variances were available then users could adjust down the 7 
statistic accordingly. 


As demonstrated by Beh and Davy (1999), components of the partitions of Pearson’s 
7? statistic can be used to get an idea of the magnitude and direction of certain 
associations in a table on a category level. These associations are approximately 
unbiased under perturbation and for tables with larger grand totals, the variance of 
these should be small. 


The earlier sections of this paper relied on results from a simulation study of two 
tables. Future work may involve looking at tables with different numbers of interior 
cells, grand totals, dimensions and additivity constraints. It may be possible to apply a 
model which predicts the perturbation variances given certain known table attributes. 
It would also be useful to look further at the correlation structure of components of 
the 2 statistic under perturbation. This information is needed in order to determine 
approximate confidence intervals for this statistic and the other smaller partitions. 


Finally in this paper we demonstrated that there are a variety of different information 
loss measures that can be applied to perturbed tables. It was found that in general 
information loss will decrease as the grand total in a table increases. There is no one 
ideal measure of information loss and this attribute is often hard to measure. Further 
work needs to be done in this area to determine the best suite of information loss 
measures to communicate to users. 
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